43 research outputs found

    Refined Complexity of PCA with Outliers

    Get PDF
    Principal component analysis (PCA) is one of the most fundamental procedures in exploratory data analysis and is the basic step in applications ranging from quantitative finance and bioinformatics to image analysis and neuroscience. However, it is well-documented that the applicability of PCA in many real scenarios could be constrained by an "immune deficiency" to outliers such as corrupted observations. We consider the following algorithmic question about the PCA with outliers. For a set of nn points in Rd\mathbb{R}^{d}, how to learn a subset of points, say 1% of the total number of points, such that the remaining part of the points is best fit into some unknown rr-dimensional subspace? We provide a rigorous algorithmic analysis of the problem. We show that the problem is solvable in time nO(d2)n^{O(d^2)}. In particular, for constant dimension the problem is solvable in polynomial time. We complement the algorithmic result by the lower bound, showing that unless Exponential Time Hypothesis fails, in time f(d)no(d)f(d)n^{o(d)}, for any function ff of dd, it is impossible not only to solve the problem exactly but even to approximate it within a constant factor.Comment: To be presented at ICML 201

    Parameterized complexity of PCA

    Get PDF
    We discuss some recent progress in the study of Principal Component Analysis (PCA) from the perspective of Parameterized Complexity.publishedVersio

    Consistency-Checking Problems: A Gateway to Parameterized Sample Complexity

    Full text link
    Recently, Brand, Ganian and Simonov introduced a parameterized refinement of the classical PAC-learning sample complexity framework. A crucial outcome of their investigation is that for a very wide range of learning problems, there is a direct and provable correspondence between fixed-parameter PAC-learnability (in the sample complexity setting) and the fixed-parameter tractability of a corresponding "consistency checking" search problem (in the setting of computational complexity). The latter can be seen as generalizations of classical search problems where instead of receiving a single instance, one receives multiple yes- and no-examples and is tasked with finding a solution which is consistent with the provided examples. Apart from a few initial results, consistency checking problems are almost entirely unexplored from a parameterized complexity perspective. In this article, we provide an overview of these problems and their connection to parameterized sample complexity, with the primary aim of facilitating further research in this direction. Afterwards, we establish the fixed-parameter (in)-tractability for some of the arguably most natural consistency checking problems on graphs, and show that their complexity-theoretic behavior is surprisingly very different from that of classical decision problems. Our new results cover consistency checking variants of problems as diverse as (k-)Path, Matching, 2-Coloring, Independent Set and Dominating Set, among others

    Parameterized k-Clustering: Tractability Island

    Get PDF
    In k-Clustering we are given a multiset of n vectors X subset Z^d and a nonnegative number D, and we need to decide whether X can be partitioned into k clusters C_1, ..., C_k such that the cost sum_{i=1}^k min_{c_i in R^d} sum_{x in C_i} |x-c_i|_p^p <= D, where |*|_p is the Minkowski (L_p) norm of order p. For p=1, k-Clustering is the well-known k-Median. For p=2, the case of the Euclidean distance, k-Clustering is k-Means. We study k-Clustering from the perspective of parameterized complexity. The problem is known to be NP-hard for k=2 and it is also NP-hard for d=2. It is a long-standing open question, whether the problem is fixed-parameter tractable (FPT) for the combined parameter d+k. In this paper, we focus on the parameterization by D. We complement the known negative results by showing that for p=0 and p=infty, k-Clustering is W1-hard when parameterized by D. Interestingly, the complexity landscape of the problem appears to be more intricate than expected. We discover a tractability island of k-Clustering: for every p in (0,1], k-Clustering is solvable in time 2^O(D log D) (nd)^O(1).publishedVersio

    On Coresets for Fair Clustering in Metric and Euclidean Spaces and Their Applications

    Get PDF
    Fair clustering is a constrained clustering problem where we need to partition a set of colored points. The fraction of points of each color in every cluster should be more or less equal to the fraction of points of this color in the dataset. The problem was recently introduced by Chierichetti et al. (2017) [1]. We propose a new construction of coresets for fair clustering for Euclidean and general metrics based on random sampling. For the Euclidean space Rd, we provide the first coreset whose size does not depend exponentially on the dimension d. The question of whether such constructions exist was asked by Schmidt et al. (2019) [2]and Huang et al. (2019) [5]. For general metrics, our construction provides the first coreset for fair clustering. New coresets appear to be a handy tool for designing better approximation and streaming algorithms for fair and other constrained clustering variants

    Building Large k-Cores from Sparse Graphs

    Get PDF
    A popular model to measure network stability is the k-core, that is the maximal induced subgraph in which every vertex has degree at least k. For example, k-cores are commonly used to model the unraveling phenomena in social networks. In this model, users having less than k connections within the network leave it, so the remaining users form exactly the k-core. In this paper we study the question of whether it is possible to make the network more robust by spending only a limited amount of resources on new connections. A mathematical model for the k-core construction problem is the following Edge k-Core optimization problem. We are given a graph G and integers k, b and p. The task is to ensure that the k-core of G has at least p vertices by adding at most b edges. The previous studies on Edge k-Core demonstrate that the problem is computationally challenging. In particular, it is NP-hard when k = 3, W[1]-hard when parameterized by k+b+p (Chitnis and Talmon, 2018), and APX-hard (Zhou et al, 2019). Nevertheless, we show that there are efficient algorithms with provable guarantee when the k-core has to be constructed from a sparse graph with some additional structural properties. Our results are - When the input graph is a forest, Edge k-Core is solvable in polynomial time; - Edge k-Core is fixed-parameter tractable (FPT) when parameterized by the minimum size of a vertex cover in the input graph. On the other hand, with such parameterization, the problem does not admit a polynomial kernel subject to a widely-believed assumption from complexity theory; - Edge k-Core is FPT parameterized by the treewidth of the graph plus k. This improves upon a result of Chitnis and Talmon by not requiring b to be small. Each of our algorithms is built upon a new graph-theoretical result interesting in its own

    On Coresets for Fair Clustering in Metric and Euclidean Spaces and Their Applications

    Get PDF
    Fair clustering is a constrained variant of clustering where the goal is to partition a set of colored points, such that the fraction of points of any color in every cluster is more or less equal to the fraction of points of this color in the dataset. This variant was recently introduced by Chierichetti et al. [NeurIPS, 2017] in a seminal work and became widely popular in the clustering literature. In this paper, we propose a new construction of coresets for fair clustering based on random sampling. The new construction allows us to obtain the first coreset for fair clustering in general metric spaces. For Euclidean spaces, we obtain the first coreset whose size does not depend exponentially on the dimension. Our coreset results solve open questions proposed by Schmidt et al. [WAOA, 2019] and Huang et al. [NeurIPS, 2019]. The new coreset construction helps to design several new approximation and streaming algorithms. In particular, we obtain the first true constant-approximation algorithm for metric fair clustering, whose running time is fixed-parameter tractable (FPT). In the Euclidean case, we derive the first (1+ϵ)(1+\epsilon)-approximation algorithm for fair clustering whose time complexity is near-linear and does not depend exponentially on the dimension of the space. Besides, our coreset construction scheme is fairly general and gives rise to coresets for a wide range of constrained clustering problems. This leads to improved constant-approximations for these problems in general metrics and near-linear time (1+ϵ)(1+\epsilon)-approximations in the Euclidean metric

    Parameterized k-Clustering: Tractability Island

    Get PDF
    In k-Clustering we are given a multiset of n vectors X subset Z^d and a nonnegative number D, and we need to decide whether X can be partitioned into k clusters C_1, ..., C_k such that the cost sum_{i=1}^k min_{c_i in R^d} sum_{x in C_i} |x-c_i|_p^p <= D, where |*|_p is the Minkowski (L_p) norm of order p. For p=1, k-Clustering is the well-known k-Median. For p=2, the case of the Euclidean distance, k-Clustering is k-Means. We study k-Clustering from the perspective of parameterized complexity. The problem is known to be NP-hard for k=2 and it is also NP-hard for d=2. It is a long-standing open question, whether the problem is fixed-parameter tractable (FPT) for the combined parameter d+k. In this paper, we focus on the parameterization by D. We complement the known negative results by showing that for p=0 and p=infty, k-Clustering is W1-hard when parameterized by D. Interestingly, the complexity landscape of the problem appears to be more intricate than expected. We discover a tractability island of k-Clustering: for every p in (0,1], k-Clustering is solvable in time 2^O(D log D) (nd)^O(1)

    Manipulating Districts to Win Elections: Fine-Grained Complexity

    Full text link
    Gerrymandering is a practice of manipulating district boundaries and locations in order to achieve a political advantage for a particular party. Lewenberg, Lev, and Rosenschein [AAMAS 2017] initiated the algorithmic study of a geographically-based manipulation problem, where voters must vote at the ballot box closest to them. In this variant of gerrymandering, for a given set of possible locations of ballot boxes and known political preferences of nn voters, the task is to identify locations for kk boxes out of mm possible locations to guarantee victory of a certain party in at least ll districts. Here integers kk and ll are some selected parameter. It is known that the problem is NP-complete already for 4 political parties and prior to our work only heuristic algorithms for this problem were developed. We initiate the rigorous study of the gerrymandering problem from the perspectives of parameterized and fine-grained complexity and provide asymptotically matching lower and upper bounds on its computational complexity. We prove that the problem is W[1]-hard parameterized by k+nk+n and that it does not admit an f(n,k)â‹…mo(k)f(n,k)\cdot m^{o(\sqrt{k})} algorithm for any function ff of kk and nn only, unless Exponential Time Hypothesis (ETH) fails. Our lower bounds hold already for 22 parties. On the other hand, we give an algorithm that solves the problem for a constant number of parties in time (m+n)O(k)(m+n)^{O(\sqrt{k})}.Comment: Presented at AAAI-2
    corecore